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within the document; determining a text size of the cells; 
selecting some of the cells using the text size of the cells; 
extracting in a text only output a text content of the se- 
lected cells; whereby the text only output extracted can 
be used to produce a summary of a portion of text of the 
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WO 02/33584 Al I lllll llllllll II llllll lllll III! I II III Hill lllll Hill lllll llll lllllll llll llll llll 



Published: For two-letter codes and other abbreviations, refer to the "Guid- 

— with international search report ance Notes on Codes and Abbreviations " appearing at the begin- 

ning of each regular issue of the PCT Gazette. 



WO 02/33584 



PCT/CA00/01225 

- 1 - 

TEXT EXTRACTION METHOD FOR HTML PAGES 



Field of the Invention 

The invention relates to the field of extracting the contents of documents, 
5 especially the contents of web pages. 

Background of the Invention 

Because of the incredible quantity of documents available on the 
Internet, people surfing on the Internet often have the impression that they will 

10 not be able to find what they are looking for in a timely fashion. When search 
tools return a list of hits for particular keywords which comprises more than 15 
hits, it is inefficient for a user to follow each link and read through the material 
available on the web site before deciding if the hit is relevant. 

Summarizing tools have been created which try to extract the particular 

15 meaning of the contents of documents using statistical analysis of the words to 
better direct the users through the documents available. These summarizing 
tools are very efficient with conventional documents such as papers, essays, 
books, etc., but yield very limited results when used with web pages because of 
the presence of banners, links, tables, frames and other presentation and 

20 display tools which separate and organize portions of text. 

Many text summarizing tools are available on the market. A few such 
tools are the ConText tool by Oracle, the Text Extractor by National Research 
Council of Canada (NRC), the Summarizer SDK by inxight and the Word 
AutoSummarize feature by Microsoft. Also available is the text-only save option 

25 in Internet Explorer 5.0 by Microsoft. It allows to save a document without the 
HTML formatting. 

NRC Extractor takes a text file as input and generates a list of keywords 
and keyphrases as output. The output keyphrases are intended to serve as a 
short summary of the input text file. Extractor uses a statistical approach to 
30 summarizing. Using this approach, the frequency of appearance of words and 
their derivatives (stems) together with their relative position with respect to the 
top of the page, among others, are important factors. Extractor uses 12 
statistical parameters. As can be understood from this description of Extractor, 
when such an algorithm is faced with a web page to be summarized, the 
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summary is polluted with many words and phrases irrelevant to the contents of 
the page but highly relevant to the navigation on the site. 

Referring to FIG. 1, a web page including a news article is shown. This 
web page was available on October 17, 2000 at 
5 www.zdnet.com/zdnn/stories/news/0,4586, 261 9342, 00. html. The contents of 
the web page are diluted by words such as Zdnet, Page one, Business, 
Internet, Contact Us, Breaking news, etc. These words, which are irrelevant to 
the contents of the news item but highly relevant to the web site, are frequent 
and often appear above the text of the article. 
10 FIG. 1 is a schematic representation of the web page mentioned above. 

The contents of the web page has been divided into tables to highlight the 
structure of the document. The browser 19 displays the web page. The 
following is a description of the contents of each table identified in the web 
page: 

15 20. ZDNet navigation hyperlinks : Cameras, Reviews, Shop, Business, Help, 
News, Electronics, GameSpot, Tech Life, Downloads, Developer. 

21 . The ZDNet banner with their logo. 

22. ZDNet's highlighted hyperlinks : Tech Business insider, Outlet Store 
Savings, Free Downloads. 

20 23. The hierarchical position of the article : ZDNet > ZDNet News Page One > 
Business > Lane gets new job, blasts Ellison. 

24. An ad banner, in this case, MasterCard™. 

25. A Search For tool. 

26. The ZDNet Business section logo together with the Wall Street Journal logo. 
25 27. The Sections frame. 

28. The Breaking news frame with a sample of 5 news items. 

29. The hyperlinks for the following news sections : Page One, Business, 
Commentary, Computing, eCrime, Law and You, International, Internet, 
Investor, Mac/Apple, TalkBack Central. 

30 30. The top stories hyperlinks with a sample of 6 news items. 

31. The hyperlinks to communicate with ZDNet : Contact Us, Corrections, 
Custom News. 

32. The operations section : E-mail this, Print this, Save this. 

33. A hyperlink to the Air Tech news radio. 
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34. An ad frame. 

35. Related Sites hyperlinks such as AnchorDesk, lnter@ctive Week, MSNBC 
News, eWEEK, Sm@rt Partner, ZDNet Asia, etc. 

36. The main body and contents of the news item, a news article. 

37. The second portion of the main body and contents of the news item. 

38. A table of hyperlinks to other related sites. 

39. An hyperlink to the tool to submit comments on the news item. 

40. Hyperlinks to more articles on the same story. 

41. ORCL links : News, Profile, Chart, Estimates. 

42. Short summary of the news article. 

Not shown are other hyperlinks to ads, related articles and related web 
sites located at the bottom of the web page and accessible by scrolling the page 
using the browser's tools. 

Microsoft Internet Explorer 5.0 allows a user to save a web page as text 
only. This text-only save option extracts all text from the page, even text in 
hyperlinks. 

Table 1 shows a text-only version of the web page of Fig. 1 obtained 
using the text-only save of Microsoft Internet Explorer 5.0. 

Table 1. Text-only version of the web page of FIG. 1~! 

ZDNet: News: Lane gets new job, blasts Ellison | Cameras | Reviews | 
i Shop | 

Business | Help | News | Electronics | GameSpot | Tech Life | Downloads | 
Developer 

IPO News And Analysis 
, Outlet Store Savings 
Free Downloads 

ZDNet > ZDNet News Page One > Business > Lane gets new job, blasts 
j Ellison 

Search For:NewsAll ZDNetThe Web Search, Tips, Power Search 

Page One, Business, Commentary 
I Computing, eCrime , Law & You, International, Internet, Investor, 
1 Mac/Apple, TalkBack Central 

Headline Scan, News Briefs, News Archive, News Specials 
Contact us, Corrections, Custom News 
j On the Air, Tech news, 24 hours a day, Play Radio 
Related Sites , AnchorDesk, Interactive 
Week, MSNBC News, eWEEK, Sm@rt Partner 

ZDNet Asia, ZDNet UK, ZDNet Australia, ZDNet France, ZDNet Germany, 
; ZDNet Japan, ZDNet China 

Lane gets new job, blasts Ellison 

Former top lieutenant Ray Lane and Oracle CEO Larry Ellison continue 
to battle, even as Lane takes a job with Kleiner Perkins. 
By Lee Gomes, WSJ Interactive Edition 
August 24 , 2000 7:51 AM PT 
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Ray Lane, former No. 2 executive at Oracle Corp., hardly has a bad 
thing to say about his former employer — except that it is a company 
full of yes men who tend to be less than candid about their products. 
Lane abruptly left the business -software giant in June after an eight- 
year stint. One reason was that his responsibilities as president and 
chief operating officer had been reduced by Lawrence Ellison, Oracle's 
(Nasdaq: ORCL) chief executive. Lane, 53 years old, said following his 
departure that he wanted to devote more time to his two young children 
by his second marriage. 

Sound off here!! 1 , Post your comment 

Ellison vs . Lane 

ZDNet Smart Business Magazine 

Coop's Corner: Larry Ellison and Basura-gate 
Ellison changes his account of Lane departure 
Behind Lane's resignation at Oracle 
Oracle's Ray Lane steps down 
ORCL:News, Profile, Chart, Estimates 

Wednesday, Lane announced that he will become a general partner at 
Kleiner Perkins Caufield & Byers, the prominent Silicon Valley venture- 
capital firm. 

And in an interview scheduled with that announcement, Lane harshly 
criticized Ellison, making clear that his departure from Oracle wasn't 
amicable. In response to Lane's comments, Ellison strongly defended 
himself and the company. 
A great admirer yet 

Lane said he remains a great admirer of Oracle and Ellison. He said, 
for example, that Ellison's oversight of the main Oracle database 
product in the early 1990s "saved" the company, and that lately, 
Ellison has "reinvigorated" Oracle to take advantage of the 
opportunities presented by the Internet. That work made Lane's net 
worth, based largely in Oracle stock, soar to nearly a billion dollars. 
But Lane also said that Ellison is utterly dominating the company right 
now, something that might prove to be harmful in the long run, since 
Oracle won't be able to develop the strong management team it needs. 
' [The Oracle executives] aren't leaders. They just do what Larry says. 
They wouldn't know how to make a decision without Larry making it for 
them.' Ray Lane, former No. 2 executive at Oracle 

"It's just like with kids," Lane said. "If you make all their decisions 
for them, they will go out as adults not knowing how to make decisions 
themselves." The executives now reporting to Ellison, said Lane, "are 
not decision makers. They aren't leaders. They just do what Larry says. 
They wouldn't know how to make a decision without Larry making it for 
them. " 

Lane came to Oracle, of Redwood Shores, Calif., in 1992 at a time when 
the company's credibility in the market was low. He said Wednesday that 
studies he commissioned at that time found that many customers "would 
never do business again with a Larry Ellison company. " 
The reason, Lane said, is that Oracle would sell products it didn't 
have. "Larry is a visionary, and expresses the vision so well that 
people believe it's a product." When he first got to Oracle, Lane said, 
"managers would be willing to take the order and make a lot of money, " 
even though the products often didn't exist. "That's the discipline I 
put into the company," he said. "1 told the sales force, 'After what 
Larry says is the vision, tell the customer the truth about what we can 
actually deliver. ' " 

'Needs more balance' 

Lane indicated that he is worried that with him gone, Oracle might 
lapse back to its old ways. "The company needs more balance," he said. 
Ellison rejected his former deputy's criticisms. 
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Oracle's managers , Ellison said, were in many cases chosen by Lane 
himself. "He is criticizing his own team for being weak. When did they 
become yes men? I am thrilled they are all here. They are delivering 
exceptional results." 

Ellison also said the company doesn't sell products it doesn't have. 
"He is the soul, the conscience of Oracle, and the other 45,000 of us 
are criminals?" Ellison asked. "It's astounding. We don't sell products 
that don * t exist because it's against the law." 

Even while he was at Oracle, Lane was sometimes outspoken on the 
subject of Ellison. Once, for example, he described how top executives 
of Boeing Corp. were no longer dealing with Oracle about an important 
"business- to-business " contract because they were angry that Ellison 
had publicly stated, incorrectly, that Oracle had won the deal. 
Front Page, Tech Center, Money and Investing, Subscribe to wsj.com 

And his latest comments about Oracle should be viewed in the context of 
his new job. At Kleiner Perkins, he will be helping start-up companies 
in business- to-business software and services, some of which may 
potentially compete with Oracle. 

Lane said he was attracted to the venture-capital job in large part 
because it will mean less travel. "When you are spending 70 percent of 
your time on airplanes, you have to step back and say, 'Why am I doing 
this? ' 11 He also predicted a looming shakeout at many Internet 
companies, which will make his sort of operational experience even more 
valuable, since he will be able to provide guidance to the surviving 
companies . 

Lane was originally slated to stay on Oracle's board following his 
departure. He said Wednesday, though, that he might leave it in the 
fall, when his term expires. 

More stories on: Ellison vs. Lane 
See also: Business section 
Talkback : 

Ellison claims "We don't sell p 
Sounds like Gates , Jobs and any 
The answer to Ellison's rhetori 
Let me be the first to say that 
I find that throughout life tha 
Les -> Nah. . . It's all Sun's f. 
Les : I really didn't start . . . 
Les Claypool, you forgot about 
Did you ever notice its the com 
Anyone who believes Larry Ellis 
Mr. Ellison is the bad guy. . . . 
Always research the company beh 
Mark, actually I noticed compan 
Did you ever notice how similar 
05:46a NEC sets sail with Transme 



- Daniel Welch 

- de 

- john major 
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- John Bannon 

- Dave Rothgery 
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05:46a Excite^Home offers do-it-yourself cable 

05:39a Madonna gives cyber squat ter the boot 

04:44a Investor AM: Catalyst wanted to spur tech stocks 

04:2 8a AMD ships 1 . 2GHz Athlons 

More . . . 

AOL wireless: No training wheels? 
EFF defends nameless Netizens 

Open-source angst : Fear of forking 
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When a text summarizer such as the NRC Extractor is used on a text- 
only version of a web page, the results are less than satisfying, as can be seen 
from the following keywords and keyphrases extracted by the NRC Extractor 
5 from the text-only version of Table 1 . 

Keyphrases: 

Lane, Ellison, Oracle, ZDNet, business, news, Larry 
Highlights: 
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• ZDNet> ZDNet News Page One> Business> Lane gets new job, blasts 
Ellison 

• Ray Lane, former No. 2 executive at Oracle Corp., hardly has a bad 
thing to say about his former employer - except that it is a company full 

5 of yes men who tend to be less than candid about their products. 

• Coop's Corner: Larry Ellison and Basura-gate 

From the web page of FIG. 1, it can be calculated that the useful portion 
of the document represents 57 % of the contents of the web page (about 850 
relevant words on a total of 1500). Therefore, 43 % of the words of the 
10 document include links, comments, headers, footers, etc. Knowing that the 
success rate of Extractor is approximately 80 %, only 57 % * 80 % of the 
KeyPhrases extracted directly from a website will be accurate, that is, about 
45 %. 

Here are the keywords extracted by Extractor directly from the ZDNet 
15 article shown in FIG. 1 : Lane, Ellison, ZDNet, Oracle, business, news, Larry, 
Tech, Shop, executives, Internet, blasts Ellison. The bolded keywords 
(5 / 12 = 41 %) were extracted because of the 43 % of irrelevant words. The 
extracted highlights are as follows: 

• ZDNet: News: Lane gets new job, blasts Ellison 
20 • Business> 

• Former top lieutenant Ray Lane and Oracle CEO Larry Ellison continue 
to battle, even as Lane takes a job with Kleiner Perkins. 

• Ray Lane, former No. 2 executive at Oracle Corp., hardly has a bad 
thing to say about his former employer -- except that it is a company full 

25 of yes men who tend to be less than candid about their products. 

Most news-related web pages and HTML-created emails contain frames 
which are non-relevant to the contents of the news article. These frames 
contain links to related articles, to other web sites or publicity. This information 
can be useful for the visitor of the web site but are irrelevant to the subject 

30 discussed. Eliminating such frames is therefore useful for both extracting the 
contents of the page and, eventually, summarizing this content. Most of the 
time, these frames are placed in HTML tables. These tables help setting the 
display of the page and its semantics. 
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There is therefore a need for a text extractor which cleans superfluous 
content from web pages, especially when this superfluous content is placed in 
tables. 



5 Summary of the Invention 

Accordingly, a first object of the present invention is to extract only the 
relevant information from a document to facilitate the summarizing of the 
document. 

According to a first broad aspect of the present invention, there is 
10 provided a method of extracting a portion of text from a document including at 
least one table and ceils within the at least one table, for the purposes of 
generating a summary of contents of the document. The method comprises: 

identifying cells within the document; 

determining a text size of the cells; 
15 selecting some of the cells using the text size of the cells; 

extracting in a text only output a text content of the selected cells; 

whereby the text only output extracted can be used to produce a 
summary of a portion of text of the document excluding text from non-selected 
cells. 

20 According to a further aspect of the present invention, there is provided a 

computer readable memory for storing programmable instructions for use in the 
execution in a computer of the process of the method of extracting a portion of 
text from a document. 

According to still another aspect of the present invention, there is 

25 provided a method of extracting a portion of text from a document including at 
least one table and cells within the at least one table, for the purposes of 
generating a summary of contents of the document. The method comprises the 
steps of: 

receiving a signal, the signal containing text extracted according to the 
30 method of extracting a portion of text from a document. 

According to a further aspect of the present invention, there is provided, 
in a method of extracting a portion of text from a document including at least 
one table and cells within the at least one table, for the purposes of generating 
a summary of contents of the document, a computer data signal embodied in a 
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carrier wave comprising text extracted according to the method of extracting a 
portion of text from a document. 

According to another aspect of the present invention, there is provided a 
system for extracting a portion of text from a document including at least one 
5 table and cells within the at least one table, for the purposes of generating a 
summary of contents of the document. The system comprises: 

a cell identifier for identifying cells within the document; 

a statistics calculator for determining a text size of the cells; 

a cell selector for selecting some of the cells using the text size of the 

10 cells; 

a text extractor for extracting in a text only output a text content of the 
selected cells; 

whereby the text only output extracted can be used to produce a 
summary of a portion of text of the document excluding text from non-selected 
15 cells. 

Brief Description of the Drawings 

These and other features, aspects and advantages will become better 
understood with regard to the following description and accompanying 
20 drawings, wherein: 

FIG. 1 is a screen shot of a news web page in which formatting tables 
have been highlighted; 

FIG. 2 is an illustration of the internal structure of a document; 
FIG. 3 is a web page created using the source code of Table 3; 
25 FIG. 4 is resulting hierarchical tree structure of the web page document 

of FIG. 3 using the algorithm of Table 2; 

FIG. 5 is a flow chart of the method according to a preferred embodiment 
of the present invention; and 

FIG. 6 is a block diagram of a system according to a preferred 
30 embodiment of the present invention. 

Detailed Description of the Preferred Embodiment 

FIG. 1 shows a web page of news which contains many tables. Each 
table has been framed to illustrate the number of tables and sub-tables used to 



WO 02/33584 PCT/CA00/01225 

- 10- 

display and organize the contents of the web page. The web page shown was 
available at www.zdnet.com/zdnn/stories/news/0, 4586, 261 9342,00. HTML on 
October 17, 2000. It contains a news article entitled "Lane gets new job, blasts 
Ellison", written by Lee Gomes, published on August 24, 2000. As with many 
5 news-related web sites, the page contains, in addition to the text of the article, 
many additional links, images, ads and comments distributed around the core 
content of the article. 

FIG. 2 is the preferred internal structure used to work with the HTML 
document which contains tables. It shows how using tables facilitates the 

10 organization of the information and also how the body text of the page can be 
buried in sub-tables of sub-tables. As is apparent from FIG. 2, each cell 46 
belongs to one table 45, each table 45 has one or more cells 46, each cell 46 
has one or more cell items 47, each cell item 47 belongs to one cell 46. A cell 
item 47 can be text 48 or another table 49. This is the structure used by the 

15 algorithm of the present invention to extract information. 

The preferred embodiment of the present invention, uses essentially two 
main steps: 

1) Document Structure Extraction and Accumulation of Statistics 
on the Contents of the Document. 
20 2) Tally of the Points and Generation of the Results. 

Document Structure Extraction and Accumulation of Statistics on the Contents 
of the Document. 

The first step consists in reading the document object model (DOM) of a 
25 document and to transform it into a representation of its internal structure (as 
shown in Fig. 2) which is more user friendly, at an algorithm level, at a 
processing level and at a programming level. The DOM is received as a COM 
object of type IHTMLDocument2 (MSHTML). The Document Object Model 
(DOM) is a standard internal representation of the document structure and is 
30 used to easily access components and delete, add or edit their content, 
attributes and style. In essence, the DOM makes it possible for programmers to 
write applications which work properly on all browsers and servers, and on all 
platforms. While programmers may need to use different programming 
languages, they do not need to change their programming model. The 
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Document Object Model is a platform- and language-neutral interface that will 
allow programs and scripts to dynamically access and update the content, 
structure and style of documents. There are a plurality of versions called levels 
of DOM. The first, the DOM XML, relies on an internal tree-like representation 
5 of the document, and enables to traverse the hierarchy accordingly. The 
standard model of viewing a document is as a hierarchy of tags, with the 
computer building up an internal model of the document based on a tree 
structure. Meanwhile the HTML DOM provides a set of convenient easy-to-use 
ways to manipulate HTML documents. The initial HTML DOM merely describes 

10 methods (for example), for accessing an identifier by name, or a particular link. 
The HTML DOM is sometimes referred to as DOM Level 0 but has been 
imported into DOM Level 1. The HTML and XML DOMs form part of DOM level 
1. DOM level 2 includes DOM level 1 but adds a number of new features. 
IHTMLDocument2 is the implementation done by Microsoft of the HTML DOM 

15 Level 2. 

Once the structure of the DOM is represented in a user friendly format, it 
is then possible to extract data useful for compiling statistics on the contents by 
traveling through this hierarchical structure. Table 2 below is a simplified 
version of the pseudo-code of the preferred embodiment of the present 
20 invention which allows such an extraction. 

Table 2 . Document Structure Extraction and Accumulation of 
Statistics on the Content 

ExtractDocumentStructure (p_Document : IHTMLDocument2 ) : KTable 
Begin 

Ktable parsedDocument 

// Extract Document Title 
// 

Kcellltem pCellltem. Text (p_Document . get_title ( ) ) ; 
Kcell pCell .AddCell Item (pCellltem) ; 

parsedDocument .AddCell (pCell ) ; 

// Get a pointer to the body element. 

; // 

> IHTMLDOMNode pBodyNode = p_Document . get_body ( ) ; 

// And parse the document. 
// 

Kcell pBodyCell; 

RecursiveParse ( pBodyNode , pBodyCell, false ); 
parsedDocument .AddCell (pBodyCell) ; 
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return parsedDocument ; 




End 




RecursiveParse (p_Node : IHTMLDOMNode f p_Cell : KCell, p_ 


_bInHref 


: bOOi ) 




: Begin 




i // Iterate through all children. 




/ / 




IHTMLDOMNode pNodeCurrent = p_Node ; 




while ( pNodeCurrent ) 




Begin 




iff pNodeCurrent == IHTMLDOMTextNode ) 




Begin 




// It is a text only node. 




// Extract text and add it to current cell 




Kcellltem pCellltem (pNodeCurrent . get_data ( ) ) 


; 


// Compute word stats. 




// 




integer nWords = CountWords (pCellltem) ; 




p_Cell- >AddWords ( nWords , p_bInHref ); 




end 




else if ( pNodeCurrent == IHTMLAnchorElement ) 




Begin 




// If it is a <A HREF>, proceed with the children. 


If ( pNodeCurrent . hasChildNodes ( ) ) 




begin 




//We now are inside a Href. 




if( !p_bInHref ) 




p_Cell .AddLinks ( 1 ); 




IHTMLDOMNode pChild = pNodeCurrent . get_f irs tChild () ; 




RecursiveParse ( pChild, p_Cell, true )/ 




end 




End 




else if ( pNodeCurrent == IHTMLImageElement ) 




Begin 




p_Cell . Addlmages ( 1 ) ; 




Kcellltem 




pCellltem (pNodeCurrent . get_alternateText () ) ; 




// Compute word stats. 




// 




integer nWords = CountWords (pCellltem) ; 




p_Cell->AddWords ( nWords, true ); 




End 




else if ( pNodeCurrent « IHTMLTable ) 




Begin 




p_Cell.AddTables ( 1 ); 




// If it is a table, proceed with all table 


cells 


// 




Ktable pSubTable; 




Kcellltem pNewCelll tern . Table (pSubTable) ; 




p Cell .AddCellltem ( pNewCellltem )/ 
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// Retrieve column and row information. 

// 

pSubTable . Dimensions = 
GetTableDimensions (pNodeCurrent) ; 

// Retrieve table caption. 
// 

IHTMLDOMNode pCaption = pNodeCurrent . get_caption 0 ; 
RecursiveParse ( pCaption, subTable . Caption, false 

) ; 

// Retrieve table summary. 
// 

IHTMLDOMNode pSummary = pNodeCurrent . get_summary {) ; 
RecursiveParse ( pSummary, subTable . Summrary , false 

) ; 

// Extract content cell by cell 
// 

for( integer iRow=0; iRow < pSubTable . RowCount ; 

1R.OW4- + ) 

begin 

for( integer iCell=0; iCell < pSubTable . CellCount ; iCell++ ) 
Begin 

IHTMLTableCell pCell = ! 
pNodeCurrent . get_cell (iRow, iCell) ; 
KCell newCell; 

// Extract content 

// | 
RecursiveParse ( pCell, newCell, false ); 

subTable. TableCell ( iRow, iCell ) - newCell; 

End 

end 

End 

Else 

Begin 

// Proceed with the children. 
// 

If( pNodeCurrent . hasChildNodes { ) ) 
begin 

IHTMLDOMNode pChild = pNodeCurrent . get_first Child () ; 
RecursiveParse ( pChild, p_Cell, p_bInHref ); 
end 

End 

pNodeCurrent = pNodeCurrent . get_next Sibling ( ) ; 

End 

End 

Although the previous algorithm only supports the DOM2 implementation 
of Microsoft (the library MSHTML which contains the objects IHTMLDocUment 
2, IHTMLOMNode, IHTMLDOMTextNode, IHTMLTableElement,...)- It is to be 



WO 02/33584 PCT/CA00/01225 

-14- 

understood that it would be apparent to one skilled in the art to introduce code 
for customers who do not have the DOM2 implementation of Microsoft. 

Table 3 is an example of HTML source code used to display the web 
page of FIG. 3. FIG. 3 is a web page created using the source code of Table 3. 
5 It comprises introductory text 55, a hyperlink 56 in line 1 , col. 1 of table 1 , a text 



entry in line 2, col. 1 of table 1, an image 59 and a test entry 58 at line 1, col. 2 
of table 1 together with alternate text 60 and a table 62 within a cell 61 of a table 
at line 2, col. 2 of table 1. 



Table 3 . Source code used to create the web page of FIG 


. 3 


<HTML> 




<HEAD> 




< TITLE >Document Sample .< /TITLE> 




</heae>> 




<BODY> 




First Text. 




<1ABLE Dorder> 




<TR> 




<TD> 




<A Href =" www. copernic . com" >Table 1, line 1, 


column 1</A> 


</TD> 


<TD>Table 1, line 1, column 2, 




1 <IMG SRC= ll http : //www. copernic . com/ images/ left 


-navbar/more- 


1 button.gif' 1 ALT= 11 Alternate Text M > 




</TD> 




</TR> 




<TR> 




1 <TD>Table 1, line 2, column 1</TD> 




<TD>Table 1, line 2, column 2 




< TABLE border> 




<TR><TD>Table 2, line 1, column 1</TD></TR> 




</ TABLE > 




</td> 




</TR> 




1 </ TABLE > 




</BODY> 




i </HTML> 





FIG. 4 is an example of the hierarchical structure of the document 



10 obtained using the pseudo-code of Table 2 on the web page of FIG. 3. The 
whole web page is considered to form TableO 70. It has two rows and one 
column, it doesn't have a caption or a summary and has a number KCell of 
cells. Its title 70 is in a text string 72 equal to "Document Sample". The body of 
the table 73 comprises cell items. The first cell item is a string of text 74 

15 comprising "First Text." The second cell item is a table 75. Table 75 has 2 rows 
and 2 columns 76. Table 75 has four items as follows: a text string 78 in cell 77, 
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a text string 80 and some alternate text 81 in cell 79, a text string 83 in cell 82 
and a text string 85 together with another table 86 in cell 84. The table 86 
comprises 1 row and 1 column and the only cell 88 comprises a text string 89. 

5 Tally of the Points and Generation of the Results. 

The generation of the results is preferably the following: 

1. Extract statistics (such as number of words, depth, etc.) from the whole 
document; 

2. Travel through all tables of the document and tally their points 
10 (RankTable); 

2.1. If the number of points of a table is too low, (LowThreshold), 
remove the table; 

3. Sort the tables in order of number of points; 

4. Identify the tables with the highest numbers of points (HiThreshold) and 
15 save them in the GoodTables list; 

5. Travel through the GoodTables list. For each sub-table of a table of the 
GoodTables list; 

5.1. If its number of points is high enough (WinnerLowThreshold), the 
table is added to the GoodTables list; 
20 6. Generate the results by travelling through all tables of the document; 

6.1 . If the current table is in the GoodTables list, travel through all of its 

cells; 

6.1 .1 . Calculate the number of points of each cell (RankCell) 

6.1.2. If the number of points of each cell is sufficient 
25 (CellLowThreshold), extract the text from the cell. 



Following is a table of the thresholds used during the tally of points: 
Table 4. Preferred Thresholds used. 



LowThreshold 


i HiThreshold 


WinnerLowThreshold 


! CellLowThreshold 


0.20 


0.05 


0.30 ' 0.50 



30 Extracting Statistics from a TablefGetTableStatistics^ : 
GetTableStatistics( p_Table : KTable ) : KStatistics 
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For all cells of the table 

1 NumberOfWords = Calculate the total number of words in the table. 

2 NumberOfWordslnLinksOrlnlmages = Calculate the number of words in 
the links or the images. 

5 3 NumberOfCells = Calculate the total number of cells. 

4 WordsPerCell = (NumberOfWords - NumberOfWordslnLinksOrlnlmages) 
/ NumberOfCells 

It will be understood that the number of words calculation can be 
modified to be a count of the number of characters, the number of bits or can be 
10 transformed to be a count of the number of sentences (by identifying an 
uppercase letter followed by a plurality of characters and, eventually, a period), 
a number of meaningful words (by removing occurrences of "the", "a", "an", 
"but", "and", etc.). One could also choose to count cells if they contain at least 
one verb or at least a period. 



15 



Calculating the Number of Points of a Table (RankTable): 



RankTable( pJTable : KTable, pJVIainStats : KStatistics ) : float 
Score = 0, Depth = 0 
20 For all sub-tables of p_Table of depth Depth (0. . .n): 

1 . TableStats = Extract table statistics (GetTableStatistics) 

2. DepthFactor = 1/2 * Depth 

3. LocalScore += DepthFactor * LinkDensityFactor * (1 - 
TableStats. NumberOfWordslnLinksOrlnlmages / 

25 TableStats. NumberOfWords) 

4 LocalScore += DepthFactor * WordsPerCellFactor * 
TableStats.WordsPerCell / pJVIainStats. MaximumWordsPerCell 

5 LocalScore += DepthFactor * WordCountFactor * 
(TableStats. NumberOfWords - 

30 TableStats. NumberOfWordslnLinksOrlnlmages) / 

(pJVIainStats. NumberOfWords - 
pJVIainStats. NumberOfWordslnLinksOrlnlmages) 

6 Score = Score + LocalScore / (Number of tables of depth Depth) 
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The tally of points function uses a two-dimensional scale. The points are 
calculated by the characteristics of the table and by all of the characteristics of 
the items dependent from the table. The deeper a sub-table is in the 
hierarchical tree of structure of the page, the less it contributes to the final 
5 number of points. All tables of a specified depth (Depth) contribute to the final 
amount of points equally. Following is a table of the scale used for the tally of 
points. 



Table 5. Scale Preferably Used to Tally the Points. 



Dept 
h 


LinkDensityFactor 


WordsPerCellFactor 


WordCountFactor 




0.33 


0.33 


0.33 


1 


(1/2 1 )* 0.33 = 0.165 


(1/2 1 )* 0.33 = 0.165 


(1/2 1 )* 0.33 = 0.165 


2 


(1/2 2 )* 0.33 = 0.0825 


(1/2 2 )* 0.33 = 0.0825 


(1/2*)* 0.33 = 0.0825 


3 


(1/2 V 0.33 = 0.04125 


(1/2 3 )* 0.33 = 0.04125 


(1/2V 0.33 = 0.04125 










n 


(1/2 n )* 

LinkDensityFactor 


(1/ 2 n )* 

WordsPerCellFactor 


(1/2 n )* 

WordCountFactor 



10 

The values of the parameters HiThreshold, WinnerLowThreshold, 
CellLowThreshold, LinkDensityFactor, WordsPerCellFactor and 
WordCountFactor are preferred values which have been obtained through 
experimentation. These values are independent of the properties of the 

15 documents such as their size, their origin, etc. It would be possible to use other 
values to obtain a suitable set of parameters for the extraction. 

It should be understood that all counts done on contents of cells can be 
weighted by parameters to emphasize the importance of characteristics of the 
cells. It should therefore be understood that all additions, subtractions and 

20 multiplication can be weighted by appropriate parameters. 

Calculating the Number of Points of a Cell (RankCell): 

During the final pass for the generation of results, a last tally of points is 
done at the cell's level (RankCell). This tally of points is used to eliminate the 
25 cells which contain too many links with respect to body text. 
RankCell( p_Cell : KCell ) : float 

Return (1 - p_Cell.NumberOfWordslnLinksOrlnlmages / NumberOfWords) 
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FIG. 5 is a flow chart of the general methodology used in the previous 
algorithms. The cells in the document are identified 100, then, a text size for 
these cells is determined 101. Some cells are then selected using the text size 
information 102. For the cells selected, the text content is extracted from the 
5 cells 103. An optional step of summarizing the document using the content 
extracted from the cells is then possible 104. 

FIG. 6 is a block diagram of a system according to a preferred 
embodiment of the present invention. A document 110 with cells is provided. A 
cell identifier 111 identifies the cells within the document 110. A statistics 

10 calculator 112 uses the document 110 to calculate statistics on at least some of 
the cells of the document. A cell selector 113 uses the list of cells identifies and 
the statistics together with the document to select the cells relevant to the 
contents of the document. A text extractor 114 uses the list of cells selected and 
the document 1 1 0 to extract the text output 1 1 5. 

15 When the previous algorithms are used on the web page of FIG. 1, the 

text extracted contains 860 words of which 100 % (850 words) of the relevant 
words contained in the news article portion of the web page document. The 
extracted text is as follows in Table 6: 

"Table 6 . Extracted text 
Lane gets new job, blasts Ellison- 
Former top lieutenant Ray Lane and Oracle CEO Larry Ellison continue 
to battle, even as Lane takes a j ob with Kleiner Perkins. 

By Lee Gomes , WSCT Interactive Edition- 
August 24, 2000 7:51 AM PT- 
; Ray Lane, former No. 2 executive at Oracle Corp., hardly has a bad 
thing to say about his former employer except that it is a company 
full of yes men who tend to be less than candid about their products. 

Lane abruptly left the business -software giant in June after an eight- 
year stint. One reason was that his responsibilities as president and 
chief operating officer had been reduced by Lawrence Ellison, Oracle's 
(Nasdaq: ORCL ) chief executive. Lane, 53 years old, said following 
his departure that he wanted to devote more time to his two young 
children by his second marriage. 

More stories on: Ellison vs. Lane 

Wednesday, Lane announced that he will become a general partner at 
Kleiner Perkins Caufield & Byers, the prominent Silicon Valley 
venture-capital firm. 

And in an interview scheduled with that announcement, Lane harshly 
criticized Ellison, making clear that his departure from Oracle wasn't 
amicable. In response to Lane's comments, Ellison strongly defended 
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himself and the company. 
A great admirer yet- 

Lane said he remains a great admirer of Oracle and Ellison. He said, 
for example, that Ellison's oversight of the main Oracle database 
product in the early 1990s "saved 11 the company, and that lately, 
Ellison has " reinvigorated" Oracle to take advantage of the 
opportunities presented by the Internet. That work made Lane's net 
worth, based largely in Oracle stock, soar to nearly a billion 
dollars . 

But Lane also said that Ellison is utterly dominating the company 
right now, something that might prove to be harmful in the long run, 
since Oracle won't be able to develop the strong management team it 
needs . 

' [The Oracle executives] aren't leaders. They just do what Larry says. 
They wouldn't know how to make a decision without Larry making it for 
them. 1 - 

-- Ray Lane, former No. 2 executive at Oracle- 

"It's just like with kids , 11 Lane said. "If you make all their 
decisions for them, they will go out as adults not knowing how to make 
decisions themselves." The executives now reporting to Ellison, said 
Lane, "are not decision makers. They aren't leaders. They just do what 
Larry says. They wouldn't know how to make a decision without Larry 
making it for them. " 

Lane came to Oracle, of Redwood Shores, Calif., in 1992 at a time when 
the company's credibility in the market was low. He said Wednesday 
that studies he commissioned at that time found that many customers 
"would never do business again with a Larry Ellison company. " 

The reason, Lane said, is that Oracle would sell products it didn't 
have. "Larry is a visionary, and expresses the vision so well that 
people believe it's a product." When he first got to Oracle, Lane 
said, "managers would be willing to take the order and make a lot of 
money," even though the products often didn't exist. "That's the 
discipline I put into the company," he said. "I told the sales force, 
'After what Larry says is the vision, tell the customer the truth 
about what we can actually deliver. ' " 

'Needs more balance ' - 

Lane indicated that he is worried that with him gone, Oracle might 
lapse back to its old ways. "The company needs more balance," he said. 

Ellison rejected his former deputy's criticisms. 

Oracle's managers, Ellison said, were in many cases chosen by Lane 
himself. "He is criticizing his own team for being weak. When did they 
become yes men? I am thrilled they are all here. They are delivering 
exceptional results . " 

Ellison also said the company doesn't sell products it doesn't have. 

"He is the soul, the conscience of Oracle, and the other 45,000 of us 
are criminals?" Ellison asked. "It's astounding. We don't sell 
products that don't exist because it's against the law." 

Even while he was at Oracle, Lane was sometimes outspoken on the 
subject of Ellison. Once, for example, he described how top executives 
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of Boeing Corp . were no longer dealing with Oracle about an important 
"business -to-business h contract because they were angry that Ellison 
had publicly stated; incorrectly; that Oracle had won the deal. 

And his latest comments about Oracle should be viewed in the context 
of his new job. At Kleiner Perkins, he will be helping start-up 
companies in business -to-business software and services, some of which 
may potentially compete with Oracle. 

Lane said he was attracted to the venture - capital job in large part 
because it will mean less travel. "When you are spending 70 percent of 
your time on airplanes, you have to step back and say, * Why am I doing 
this?' •* He also predicted a looming shakeout at many Internet 
companies, which will make his sort of operational experience even 
more valuable, since he will be able to provide guidance to the 
surviving companies . 

Lane was originally slated to stay on Oracle's board following his 
departure. He said Wednesday; though, that he might leave it in the 
fall, when his term expires. 
See also: Business section- 
Enter a company- 

This extracted text can then be put through a summarizer of the prior art 
to obtain a relevant summary. For example, if the previous extracted text is put 
through the summarizer of CNRC, the following summary is obtained (which is 
fully relevant): 

Keyphrases: Lane, Oracle, Ellison, Larry, Executives, Business, Kleiner 
Perkins, Ray Lane, Vision, sell products, Managers, chief operating officer. 

Highlights: 

• Lane gets new job, blasts Ellison-Former top lieutenant Ray Lane and 
Oracle CEO Larry Ellison continue to battle, even as Lane takes a job 
with Kleiner Perkins. 

• The executives now reporting to Ellison, said Lane, "are not decision 
makers. 

• He said Wednesday that studies he commissioned at that time found that 
many customers "would never do business again with a Larry Ellison 
company." 

While the invention has been described in connection with specific 
embodiments thereof, it will be understood that it is capable of further 
modifications and this application is intended to cover any variations, uses, or 
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adaptations of the invention following, in general, the principles of the invention 
and including such departures from the present disclosure as come within 
known or customary practice within the art to which the invention pertains and 
as may be applied to the essential features hereinbefore set forth, and as 
5 follows in the scope of the appended claims. 
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CLAIMS 

1. A method of extracting a portion of text from a document including a 
plurality of cells in at least one table, for the purposes of generating a summary 
of contents of said document, the method comprising: 

identifying cells within said document; 
determining a text size of said cells; 

selecting some of said cells usingat least said text size of said cells; 

extracting in a text only output a text content of said selected cells; 

whereby said text only output extracted can be used to produce a 
summary of a portion of text of said document excluding text from non-selected 
cells. 

2. A method as claimed in claim 1, wherein said step of determining a text size 
of said cells comprises determining a number of words contained in said cells 
and said step of selecting comprises ranking said cells using said number of 
words and selecting some of said cells having a highest rank. 

3. A method as claimed in any one of claims 1 and 2, wherein said identifying 
cells within said document comprises building a hierarchical tree structure for 
said document and said selecting some of said cells comprises using said 
hierarchical tree structure to determine a depth of said cells within said structure 
and selecting some of said cells having a large text size value and a low depth 
value. 

4. A method as claimed in any one of claims 1 to 3, wherein said determining a 
text size of said cells comprises calculating a number of hyperlinked words 
contained in said cells and subtracting said number of hyperlinked words from a 
total number of words contained in said cells to obtain a number of words of a 
text content of said cells and wherein said selecting comprises selecting some 
of said cells using said number of words of said text content of said cells. 

5. A method as claimed in any one of claims 1 to 5, wherein said determining a 
text size of said cells comprises calculating a number of words of an alternate 
text element contained in said cells and adding said number of words of said 

SUBSTITUTE SHEET (RULE 26) 
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alternate text element to a total number of words contained insaid cells to obtain 
a complete number of words of said cells and wherein said selecting comprises 
selecting some of said cells using said complete number of words of said cells. 

6. A method as claimed in any one of claims 1 to 5, wherein said selecting 
comprises 

calculating a rank for each of said cells as a function of said cell size and 
selecting cells with a highest rank. 

7. A method as claimed in any one of claims 1 to 6, wherein said identifying 
cells comprises 

identifying at least one table; and 

identifying at least one cell within each said at least one table. 

8. A method as claimed in claim 7, wherein said at least one cell within each 
said at least one table comprises at least one sub-table within said at least one 
cell. 

9. A method as claimed in any one of claims 7 and 8, wherein said determining 
a text size comprises: 

determining at least one of a number of words in said table, a number of 
words in links or images of said table, a number of cells in said table, a number 
of words per cell in said table, a depth of said table and a maximum number of 
words per cell; 

and wherein said selecting some of said cells comprises: 
calculating a score for said table; 

if said score is lower than a low threshold value, eliminating said table; 
if said score is higher than said low threshold value, selecting said table. 

10. A method as claimed in claim 9, wherein said selecting further comprises: 

if said score is higher than a high threshold value, selecting said table. 

11. A method as claimed in any one of claims 7 to 10, wherein, for each sub- 
tableincluded in a cell within said selected table, the method further comprises: 

SUBSTITUTE SHEET (RULE 26) 
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calculating a sub-score for each said sub-table; 

if said sub-score is higher than a sub-table threshold value, selecting 
said sub-table to be a selected table. 

12. A method as claimed in anyone of claims 1 to 6, wherein said determining a 
text size of said ceils comprises: 

determining a number of words contained in said cells; and 

determining a number of a number of words in links or images of said 

cells 

and wherein said selecting some of said cells using said text size of said cells 
comprises: 

calculating a cell score value for said cells using said number of words in 
links or images and said number of words; 

if said cell score value is higher than a cell threshold value, selecting said 

cell. 

13. A method as claimed in any one of claims 7 to 12, wherein said determining 
a text size of said cells further comprises: 

determining a number of words contained in each said cells of said 
selected table; and 

determining a number of a number of words in links or images of said 
cells of said selected table; 

and wherein said selecting some of said cells using said text size of said cells 
further comprises: 

calculating a cell score value for said cells of said selected table using 
said number of words in links or images and said number of words; 

if said cell score value is higher than a cell threshold value, selecting said 

cell. 

14. A computer readable memory for storing programmable instructions for use 
in the execution in a computer of the process of any one of claims 1 to 1 3. 
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15. A method of extracting a portion of text from a document including at 
least one table and cells within said at least one table, for the purposes of 
generating a summary of contents of said document, comprising the steps of: 

receiving a signal, said signal containing text extracted according to the 
method as defined in claim 1 to 13. 

16. In a method of extracting a portion of text from a document including at 
least one table and cells within said at least one table, for the purposes of 
generating a summary of contents of said document, a computer data signal 
embodied in a carrier wave comprising: 

text extracted according to the method as defined in claim 1 to 13. 
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